Warm-up (2-3 minutes)
What is something that you are good at doing?
How did you get good at the thing you are good at doing?
About Me…
Ph.D. in Statistics from Montana State University
B.S. in Mathematics & B.B.A. in Economics
from Colorado Mesa University
Every analysis we will do assumes a structure like:
output = f(input) + (noise)
…or, if you prefer…
response variable(s) = f(explanatory variables) + (noise)
dependent variable(s) = f(independent variables) + (noise)
target = f(predictors) + (noise)
In any case: we are trying to reconstruct information in data, and we are hindered by random noise.
The function f might be very simple…
Y = X + \(\epsilon\)
…or very complex
\(z_i = b_o + b_1 x_i\)
\(q_i = \frac{1}{1 + exp(-z_i)}\)
\(y_i \sim Bern(q_i)\)
You will often hear people refer to machine learning in reference to the topics in this class.
My opinion:
Statistical learning is more concerned with the model structure, interpretation of estimates, and understanding error.
Machine learning is more concerned with model implementation and computational demands.
Often, the nature of our models will differ depending on the types of data involved!
regression
the response variables are quantitative
classification
the response variables are categorical
supervised learning: our data includes observations of the output variable
What drug treatments are associated with better disease outcomes?
unsupervised learning: our data does NOT include any observations of the output variable
What social groups already exist among the Stat 551 students?
So, why do we care about estimating f?
prediction: We are trying to use future inputs to guess about future outputs.
Which articles are Dr. T. most likely to read?
inference: We are trying to tell a story about the relationship between variables.
Which genes are more activated when breast cancer is present?
It is important to think carefully about:
Assumptions: What do various models assume to be true about the data structure? Are these justified?
Interpretations: What can we learn by estimating f for a particular model? Is that information what we are looking for?
Estimation: How is each f being approximated? Will this be a close approximation?
Usage: What are we going to do once we estimate f? Do certain models lend themselves better than others?
If we are doing prediction, we mostly don’t care about assumptions.
The “best” model is the model that predicts most accurately.
But: What measure of accuracy do we prioritize?
If we are doing inference, we care a lot about assumptions.
The “best” model is the one that matches the truth.
But: What the heck is the truth???
You will learn:
To apply many different models to real data using R.
To interpret the output of these model estimates
To use cross-validation to compare models
To explain the general structure and philosophy behind each model
To select an appropriate “best” model for a data analysis, and make a well-reasoned argument for your choice.